Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql,kvserver: stop gossiping the system config #76279

Merged

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Feb 9, 2022

The first commit removes the system config gossip trigger and the client code to set it,
after a version gate has been passed. The primary major change is that in 22.1 binaries, we
no longer rely on the gossip data at any point, instead we adopt the systemconfigwatcher which
is backed by a rangefeed. Part of this change is to adopt that separate data source in the
GossipSubscription implementation of the Connector API.

Subsequently, a number of tests needed updating.

Remove sql.catalog.unsafe_skip_system_config_trigger.enabled
This cluster setting is no longer useful.

Fixes #54477.
Fixes #70560.

Release note (sql change): The limitation that schema change statements in a transaction could not follow
DML statements has been lifted.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner ajwerner force-pushed the ajwerner/remove-system-config-gossip branch 7 times, most recently from 303f582 to ce08f27 Compare February 14, 2022 05:55
@ajwerner ajwerner marked this pull request as ready for review February 14, 2022 13:33
@ajwerner ajwerner requested a review from a team February 14, 2022 13:33
@ajwerner ajwerner requested review from a team as code owners February 14, 2022 13:33
@ajwerner ajwerner requested review from stevendanna and removed request for a team February 14, 2022 13:33
@ajwerner ajwerner force-pushed the ajwerner/remove-system-config-gossip branch from ce08f27 to dea4797 Compare February 14, 2022 13:39
@shermanCRL shermanCRL requested a review from miretskiy February 14, 2022 14:10
@ajwerner ajwerner force-pushed the ajwerner/remove-system-config-gossip branch 3 times, most recently from ccddacc to 37e04e4 Compare February 14, 2022 19:06
Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 59 of 72 files at r1, 6 of 6 files at r2, 7 of 7 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @miretskiy, @nvanbenschoten, and @stevendanna)


pkg/clusterversion/cockroach_versions.go, line 274 at r3 (raw file):

	// DisableSystemConfigGossipTrigger is a follow-up to EnableSpanConfigStore
	// to disable the data propagation mechanism it and the entire spanconfig
	// infrastructure obviate.

obviates?


pkg/config/system.go, line 193 at r3 (raw file):

func (s *SystemConfig) getIndexBound(key roachpb.Key) int {
	if s == nil {
		panic("here")

Looks like debugging holdover.


pkg/sql/plan.go, line 545 at r3 (raw file):

func (p *planner) maybeSetSystemConfig(id descpb.ID) error {
	if !descpb.IsSystemConfigID(id) || p.execCfg.Settings.Version.IsActive(
		p.EvalContext().Ctx(), clusterversion.DisableSystemConfigGossipTrigger,

Nice.


pkg/kv/kvserver/client_lease_test.go, line 109 at r3 (raw file):

	zcfg := zonepb.DefaultZoneConfig()
	settings := cluster.MakeTestingClusterSettingsWithVersions(
		clusterversion.TestingBinaryMinSupportedVersion,

[pedant] Here and below: should these test be running at DisableSystemConfigGossipTrigger - 1?


pkg/kv/kvserver/client_split_test.go, line 1021 at r3 (raw file):

	ctx := context.Background()
	serv, _, _ := serverutils.StartServer(t, base.TestServerArgs{
		// This test was written with the SystemConfigSpan in mind.

There are a handful of tests that make use of DisableSpanConfigs. Some of them, like this one, were too specific to the system config span. Others were just flakey with span config asynchrony and are still being investigated by the home teams. What were you thinking to do for the latter kind?


pkg/kv/kvserver/client_split_test.go, line 3384 at r3 (raw file):

	ctx := context.Background()
	serv, sqlDB, _ := serverutils.StartServer(t, base.TestServerArgs{
		DisableSpanConfigs: true,

Do we still need this testing flag?

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @arulajmani, @irfansharif, @miretskiy, @nvanbenschoten, and @stevendanna)


pkg/clusterversion/cockroach_versions.go, line 274 at r3 (raw file):

Previously, irfansharif (irfan sharif) wrote…

obviates?

Done.


pkg/config/provider.go, line 28 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This comment is now stale.

Done.


pkg/config/provider.go, line 38 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider keeping this comment.

Done.

Code quote:

// The system config will never be updated, so return a nil channel.

pkg/gossip/gossip.go, line 1151 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is it worth opening an issue to track all of these follow-up tasks?

#76625


pkg/gossip/keys.go, line 73 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

In a cluster that only performs rolling upgrades and that adds information to gossip with no ttl, can we guarantee that the information won't stick around in the gossip network forever? If not, then we should probably mark this as a retired gossip key but keep it around in v22.2, instead of removing it entirely, as it would be incorrect to ever use the key again.

For now, I've adopted your suggestion, though it doesn't seem too wild to write a migration to just remove the information.


pkg/migration/migrations/descriptor_utils.go, line 60 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What allows us to remove this unconditionally?

We haven't added any system tables before the migration which enables all of the span config infrastructure. It's admittedly subtle.


pkg/server/node.go, line 1333 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you add a sentence about why we need to continue to support this?

Well, I said something, but it lead to me thinking more deeply about it, and I'm not pleased with the current situation. https://cockroachlabs.slack.com/archives/C020NSBH63F/p1644965222776869


pkg/server/node.go, line 1337 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you remind me what the migration story is for multi-tenant clusters? By the time we roll out a v22.2 host cluster, all SQL pods will be running v22.1?

That's correct.


pkg/kv/kvserver/client_lease_test.go, line 109 at r3 (raw file):

Previously, irfansharif (irfan sharif) wrote…

[pedant] Here and below: should these test be running at DisableSystemConfigGossipTrigger - 1?

Done.


pkg/config/system.go, line 193 at r3 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Looks like debugging holdover.

Done.

@ajwerner
Copy link
Contributor Author

TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 16, 2022

Build failed (retrying...):

@RaduBerinde
Copy link
Member

Sorry, you'll need to rebase because of #76598

bors r-

@craig
Copy link
Contributor

craig bot commented Feb 16, 2022

Canceled.

There were a pair of tests that didn't feel worth carrying along.
Follow up commit to remove the cluster setting.

Release note: None
This cluster setting is no longer useful.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/remove-system-config-gossip branch from a3d9493 to 0e45b5f Compare February 16, 2022 02:30
@ajwerner
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 16, 2022

Build succeeded:

@craig craig bot merged commit 892c640 into cockroachdb:master Feb 16, 2022
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Mar 27, 2022
Follow-on from cockroachdb#76279.

Fixes cockroachdb#75864.

This commit adds a mechanism to combine the system config data of the tenant
with the data provided over the GossipSubscription from the system tenant.
It then plumbs a version into the zone config methods. In the mixed version
state, the tenant uses the existing override from the system tenant. After
the span config infrastructure has been activated, the tenant uses the
overrides they've set. This affects, realistically, just the GC job, and,
to a lesser extent, the optimizer.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Apr 11, 2022
Follow-on from cockroachdb#76279.

Fixes cockroachdb#75864.

This commit adds a mechanism to combine the system config data of the tenant
with the data provided over the GossipSubscription from the system tenant.
It then plumbs a version into the zone config methods. In the mixed version
state, the tenant uses the existing override from the system tenant. After
the span config infrastructure has been activated, the tenant uses the
overrides they've set. This affects, realistically, just the GC job, and,
to a lesser extent, the optimizer.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Apr 11, 2022
Follow-on from cockroachdb#76279.

Fixes cockroachdb#75864.

This commit adds a mechanism to combine the system config data of the tenant
with the data provided over the GossipSubscription from the system tenant.
It then plumbs a version into the zone config methods. In the mixed version
state, the tenant uses the existing override from the system tenant. After
the span config infrastructure has been activated, the tenant uses the
overrides they've set. This affects, realistically, just the GC job, and,
to a lesser extent, the optimizer.

Release note: None
craig bot pushed a commit that referenced this pull request Apr 11, 2022
76504: *: fix system config for tenant r=ajwerner a=ajwerner

Follow-on from #76279.

Fixes #75864.

This commit adds a mechanism to combine the system config data of the tenant
with the data provided over the GossipSubscription from the system tenant.
It then plumbs a version into the zone config methods. In the mixed version
state, the tenant uses the existing override from the system tenant. After
the span config infrastructure has been activated, the tenant uses the
overrides they've set. This affects, realistically, just the GC job, and,
to a lesser extent, the optimizer.

Release note: None

78595: Week -1: follow-up emails r=rail a=celiala

This PR adds follow-up emails for:

- the day before `PrepDate` (to be sent if only there are still blockers)
- the day of `PrepDate` (to be sent if only there are still blockers)
- the day after `PrepDate` (to be sent if only there are still blockers)

The `post-blockers` command should be used for all emails before and up to `PrepDate`:
```
# where `DAYS_BEFORE_PREP` is one of: many-days | day-before | day-of

SMTP_PASSWORD=$MY_SMTP_PWD SMTP_USER=$MY_SMTP_USER GITHUB_TOKEN=$MY_GITHUB_TOKEN \
$(bazel info bazel-bin)/pkg/cmd/release/release_/release post-blockers \
  --release-series 21.1 \
  --smtp-user $MY_SMTP_USER --smtp-host smtp.gmail.com --smtp-port 587 \
  --to $MY_TO_EMAIL \
  --template-dir ./pkg/cmd/release/templates \
  --prep-date "Apr 1, 2022" --publish-date "May 1, 2022" \
  --days-before-prep-date $DAYS_BEFORE_PREP
```

The `cancel-release-date` should be used for the day after `PrepDate` email:
```
SMTP_PASSWORD=$MY_SMTP_PWD SMTP_USER=$MY_SMTP_USER GITHUB_TOKEN=$MY_GITHUB_TOKEN \
$(bazel info bazel-bin)/pkg/cmd/release/release_/release cancel-release-date \
  --release-series 21.1 \
  --smtp-user $MY_SMTP_USER --smtp-host smtp.gmail.com --smtp-port 587 \
  --to $MY_TO_EMAIL \
  --template-dir ./pkg/cmd/release/templates \
  --publish-date "May 1, 2022" --next-publish-date "Jun 1, 2022"
```

### TeamCity changes

This will be paired with the following TeamCity jobs:
*  [Send release blockers - 78595](https://teamcity.cockroachdb.com/buildConfiguration/Internal_Cockroach_Release_Process_SendReleaseBlockers_78595?branch=78595&buildTypeTab=overview&mode=builds#all-projects) 
* [Cancel release date - 78595](https://teamcity.cockroachdb.com/buildConfiguration/Internal_Cockroach_Release_Process_CancelReleaseDate_78595?branch=78595&mode=builds#all-projects)

![image](https://user-images.githubusercontent.com/3051672/160359980-93ac438f-d8e6-4f6c-a0e0-00fab55b93bc.png)


#### Manual tests

##### Test 1: post-blockers --days-before-prep-date many-days

<img width="300" src="https://user-images.githubusercontent.com/3051672/160353978-12e8d2e9-1900-471b-b785-f247d19f474a.png">

##### Test 2: post-blockers --days-before-prep-date day-before

<img width="300" src="https://user-images.githubusercontent.com/3051672/160353943-8514a0d4-87f4-4ab7-8d67-78bb42f427c2.png">

##### Test 3: post-blockers --days-before-prep-date day-of

<img width="300" src="https://user-images.githubusercontent.com/3051672/160353904-d18e6d68-594c-48dc-8308-6f0c3621fe91.png">

##### Test 4: cancel-release-date

<img width="300" src="https://user-images.githubusercontent.com/3051672/160353668-1e2886f5-c39f-4d54-884d-f967ba06a114.png">


79551: roachtest: use SQL comparison in TLP r=michae2 a=michae2

Assists: #77279

TLP compares results of two queries, one unpartitioned and one
partitioned three ways by a predicate. We were comparing TLP results by
printing values to strings. Unfortunately there are some values which
are equal according to SQL comparison, but print differently. And TLP,
being the nefarious chaos monkey that it is, tends to find these. For
example:

```sql
  0::FLOAT
  -0::FLOAT

  'hello' COLLATE "en-US-u-ks-level2"
  'HELLO' COLLATE "en-US-u-ks-level2"

  INTERVAL '1 day'
  INTERVAL '24 hours'
```

When equal-yet-different values like these are used in MAX or MIN
aggregations, or are the keys for GROUP BY bucketing, it is
nondeterministic which value will be chosen. (This is also true in
PostgreSQL.) And sometimes the two TLP queries choose differently. This
is not indicative of a bug, yet TLP fails.

So this patch teaches TLP to use SQL comparison instead of string
comparison. To do so, we wrap both queries in a bigger query:

```sql
  WITH unpart AS MATERIALIZED (
    <unpartitioned query>
  ), part AS MATERIALIZED (
    <partitioned query>
  ), undiff AS (
    TABLE unpart EXCEPT ALL TABLE part
  ), diff AS (
    TABLE part EXCEPT ALL TABLE unpart
  )
  SELECT (SELECT count(*) FROM undiff), (SELECT count(*) FROM diff)
```

Only if this query returns counts other than 0 (meaning the SQL
comparison detected a real mismatch) do we diff the printed results.

Release note: None

79767: util/contextutil: clarify `RunWithTimeout` error message r=erikgrinaker a=erikgrinaker

`RunWithTimeout` gave an error message of this form:

```
operation "send-snapshot" timed out after 1m (took 1m): context deadline exceeded
```

However, this can be misleading because either the caller or callee can
set their own context timeout that is less than then one set by
`RunWithTimeout`, in which case it looks like:

```
operation "send-snapshot" timed out after 1m (took 1s): context deadline exceeded
```

This is even worse on old versions, where the "took 1s" is not included,
leading the reader to believe the operation actually took 1m when it
only took 1s.

This patch clarifies the error message somewhat:

```
operation "send-snapshot" timed out after 1s (given timeout 1m): context deadline exceeded
```

Resolves #79424.

Release note: None

79787: protectedts: fix panic when fetching protectedts records r=adityamaru a=adityamaru

Previously, we incorrectly assumed that all records in the
`system.protectedts_records` table would have a non NULL target
column. While this is enforced for all protected timestamp
records written in a 22.1+ cluster, this is not true for records
that might have been prior to the 22.1 cluster version being
finalized.

This change makes the method responsible for reading protectedts
records more defensive when encountering a NULL target column.

Fixes: #79684

Release note: None

79796: Publish cockroach-sql in publish bleeding edge r=rickystewart a=rail

Previously, the `cockroach-sql` tool was published only as a part of the
release workflow.

The first commit refactors the way how we publish and test. This is very similar to what we use in publish-provisional-artifacts.

The second commit adds the cockroach-sql binary to the list of artifacts to publish.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: Celia La <[email protected]>
Co-authored-by: Michael Erickson <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Aditya Maru <[email protected]>
Co-authored-by: Rail Aliiev <[email protected]>
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Apr 12, 2022
Follow-on from cockroachdb#76279.

Fixes cockroachdb#75864.

This commit adds a mechanism to combine the system config data of the tenant
with the data provided over the GossipSubscription from the system tenant.
It then plumbs a version into the zone config methods. In the mixed version
state, the tenant uses the existing override from the system tenant. After
the span config infrastructure has been activated, the tenant uses the
overrides they've set. This affects, realistically, just the GC job, and,
to a lesser extent, the optimizer.

Release note: None
RichardJCai added a commit to RichardJCai/cockroach that referenced this pull request Jun 3, 2022
There is still some work left to actually remove `SystemConfigSpan`
or at least update `SystemConfigSpan` to contain just system.descriptor and
system.zones.

cockroachdb#76279

Release note: None
RichardJCai added a commit to RichardJCai/cockroach that referenced this pull request Jun 7, 2022
There is still some work left to actually remove `SystemConfigSpan`
or at least update `SystemConfigSpan` to contain just system.descriptor and
system.zones.

cockroachdb#76279

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

*: stop gossiping the system config span sql: cannot perform schema change in transaction after write
5 participants